Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
date()
## [1] "Mon Nov 27 15:59:56 2023"
[1] “Thu Nov 2 13:02:54 2023”
There seems to be many new things to learn, especially with GitHub with me. Learning to work with GitHub, as mentioned, I am expecting to improve during the course, as well as statistics on large data sets with R.
The Exercise1.Rmd-file had nice summary on handling the data in R in chapters 1 & 2. Some of them I had gotten more familiar with, but there where a lot of things to learn, especially chapter 2 had nice suggestions/tricks in the examples how to handle data frames.
Using RStudio seems still quite fine, loading files to GitHub seems still a bit unclear and I hope I will get more comfortable with it as the course goes on. Exercise set 1. was quite extensive, most was familiar/known, but it’s not bad to refresh your memory.
GitHub repository: https://github.com/venkatharina/IODS-project GitHub diary: https://venkatharina.github.io/IODS-project/
date()
## [1] "Mon Nov 27 15:59:56 2023"
R-scipt: https://github.com/venkatharina/IODS-project/blob/master/Chapter2_Data_Wrangling.R
newfile <- read.table(url("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt"), sep = ",", header = TRUE)
dim(newfile) #166 rows and 7 columns
## [1] 166 7
head(newfile)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
colnames(newfile)
## [1] "gender" "age" "attitude" "deep" "stra" "surf" "points"
#column names:"gender","age","attitude","deep","stra","surf","points"
###
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
summary(newfile)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
table(newfile$gender)
##
## F M
## 110 56
ggpairs(newfile, lower = list(combo = wrap("facethist", bins = 20)))
#as individuals mean values are age:25.5, attitude:3.143, deep:3.68, stra:3.12, surf:2.78, points:22.7
#there are 110 females, and 56 males
#there is significant correlation between surf-attitude*, points-attitude***, surf-deep*** and surf-stra*
#visually, attitude, stra, surf seem normally distributed
#choosing three variables: attitude, surf, stra
model = lm(points ~ attitude+surf+stra, data=newfile)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + surf + stra, data = newfile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
par(mfrow = c(2,2))
#attitude with p-value 1.93e-08 *** seem highly significant in the model
#where as surf and stra are not significant to explain points
#model explains ~20% of variance of points (multiple r-squared)
#if we remove variables surf and stra we get model2
model2 = lm(points ~ attitude, data=newfile)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = newfile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
#attitude explains significantly points but only 19% of variance (multiple r-squared)
#plotting models
par(mfrow = c(2,2))
plot(model)
par(mfrow = c(2,2))
plot(model2)
#generally all the plots look good (model linear)
#top left: residuals vs fitted plot; close to zero (independent from each other), fitted values are forecasts of observed values, residuals what are left out of forecast
#to right: QQ-plot; residuals compared to normal distribution seem very linear (on the line)
#low left: scale-location plot; absolute residuals values with fitted values, similar magnitudes of residuals across fitted values
#low right: residuals vs leverage plot; how far independent variables are from observations; no points with high residue and high leverage
date()
## [1] "Mon Nov 27 16:00:00 2023"
R-Script: https://github.com/venkatharina/IODS-project/blob/Data/'create_alc.R
#reading table
a <- read.table(url("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"), sep = ",", header = TRUE)
#looks same as I got from Data wrangling exercises
dim(a) #dimension of table: 370 observations and 35 variables
## [1] 370 35
str(a) #structure of table
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
colnames(a) #column names of table
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
#There are 35 variables, with logical, integer, double and character vectors
#Table predicts student performance in secondary education (high school) with alcohol consumption
#The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires
#To study relation to low/high alcohol consumption:
#I selected Fedu, studytime, romantic, absences
#I think maybe parents higher education, high studytime would lead to low alcohol consumption
#And romantic relationships and absences from school would high alcohol consumption
##############################################################################
c.names = c("Fedu","studytime","romantic","absences")
par(mfrow = c(2,2))
for(ii in 1:length(c.names)){
barplot(table(a[,c.names[ii]]), main = c.names[ii], col="blue")
}
#all females are educated, many also highly
#usually people study around 2h/week
#majority are not in a romantic relationship
#majority of absences are between 0-5h
#alcohol consumption vs. female education
table(a$alc_use,a$Fedu)
##
## 0 1 2 3 4
## 1 2 25 42 37 34
## 1.5 0 15 15 18 15
## 2 0 9 18 15 14
## 2.5 0 7 9 13 12
## 3 0 8 10 8 6
## 3.5 0 5 5 1 6
## 4 0 1 1 5 2
## 4.5 0 1 1 0 1
## 5 0 2 4 0 3
plot(table(a$alc_use,a$Fedu))
#there does not seem to be very clear pattern
#overall people seem to drink only 1 dose/week, no matter on mothers education
#maybe at very high 5 doses/week, education of the mother is 1-2
#alcohol consumption vs. studytime
table(a$alc_use,a$studytime)
##
## 1 2 3 4
## 1 21 72 30 17
## 1.5 18 31 10 4
## 2 17 25 12 2
## 2.5 10 25 5 1
## 3 12 18 1 1
## 3.5 11 4 2 0
## 4 4 4 0 1
## 4.5 0 3 0 0
## 5 5 3 0 1
plot(table(a$alc_use,a$studytime))
#people who study less (1h), seem to drink more in high dosages (3.5-5)
#people who study more (4h), seem to drink less overall
#average people study 2h/week and drink 1 dose
#alcohol consumption vs. romantic
table(a$alc_use,a$romantic)
##
## no yes
## 1 98 42
## 1.5 42 21
## 2 33 23
## 2.5 29 12
## 3 25 7
## 3.5 13 4
## 4 6 3
## 4.5 1 2
## 5 4 5
plot(table(a$alc_use,a$romantic))
#it seems that overall majority drinks only 1 dose/week
#number of people in a romantic relationship seem to drink less,
#until we get into very high doses (4.5-5)
#alcohol consumption vs. absences
table(a$alc_use,a$absences)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 20 21 26 27 29
## 1 31 18 24 21 14 4 10 3 6 2 4 0 1 0 0 0 0 0 0 0 1 0 0 0
## 1.5 10 10 8 6 5 7 3 2 4 2 0 2 0 1 0 0 0 1 0 2 0 0 0 0
## 2 9 9 8 4 5 5 3 4 4 1 1 0 2 0 1 0 0 0 0 0 0 0 0 0
## 2.5 6 5 3 4 5 2 3 1 1 3 0 2 0 0 3 0 1 0 0 0 0 0 1 0
## 3 4 6 6 1 2 2 1 0 2 1 0 1 2 0 0 1 0 0 1 0 0 1 0 1
## 3.5 3 2 2 0 3 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0
## 4 0 0 3 1 0 0 1 1 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
## 4.5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
## 5 0 0 2 1 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0
##
## 44 45
## 1 0 1
## 1.5 0 0
## 2 0 0
## 2.5 1 0
## 3 0 0
## 3.5 0 0
## 4 0 0
## 4.5 0 0
## 5 0 0
plot(table(a$alc_use,a$absences))
#people who drink less (1 dose/week), seem to be have less absences
#people who drink more (4-5 doses/week), seem to have more absences
#majority of people still drink less as said before
#previously I predicted that high mothers education and high studytime would lead to less drinking
#seems by estimation that mothers education does not seem to have such a huge effect
#but high studytime seem to lead to less alcohol consumption
#also
#I predicted that romantic relationships and high amount of absences leads to high alcohol consumption
#seem romantic relationships effect the opposite way, to less alcohol
#low absences lead to low alcohol consumption, high absences to high alcohol consumption
##############################################################################
#doing logistic regression to high_use (high alcohol consumption)
#where Fedu, studytime, absences and romantic status are predictors
summary(a$high_use)#high_use FALSE 259, TRUE 111
## Mode FALSE TRUE
## logical 259 111
table(as.numeric(a$high_use),a$high_use) #FALSE 0, TRUE 1
##
## FALSE TRUE
## 0 259 0
## 1 0 111
#high_use outcome variable --> modeling TRUE answer
m=glm(high_use ~ Fedu + studytime + absences + romantic, data=a)
summary(m)
##
## Call:
## glm(formula = high_use ~ Fedu + studytime + absences + romantic,
## data = a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9654 -0.3109 -0.1951 0.5178 0.9755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4473666 0.0848465 5.273 2.31e-07 ***
## Fedu -0.0001055 0.0211342 -0.005 0.996018
## studytime -0.1023084 0.0272659 -3.752 0.000204 ***
## absences 0.0171176 0.0042214 4.055 6.13e-05 ***
## romanticyes -0.0474788 0.0493497 -0.962 0.336641
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1941708)
##
## Null deviance: 77.700 on 369 degrees of freedom
## Residual deviance: 70.872 on 365 degrees of freedom
## AIC: 450.54
##
## Number of Fisher Scoring iterations: 2
#Fedu and romanticYES does not seem to be significant in this model
##Fedu would decrease alcohol consumption but not significantly
##romanticYESwould decrease alcohol consumption but not significantly
#studytime seems to decrease probability for high alcohol consumption
#abcences seem to increase probability for high alcohol consumption
#I predicted: mother education and studytime would decrease alcohol consumption
#And: romantic relationship and absences would increase alcohol consumption
#Mothers education and studytime will predict decreased alcohol consumption, but mothers eduction does not do that significantly
#Romantic relationships do not increase alcohol consumption, it decreases it, but not significantly
#Absences do increase high alcohol consumption as I predicted in the beginning
##############################################################################
#creating new logistic regression model with significant variables
#high_use ~ studytime + absences significantly
mm=glm(high_use ~ studytime + absences, data=a)
summary(mm)
##
## Call:
## glm(formula = high_use ~ studytime + absences, data = a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9803 -0.3126 -0.2292 0.5221 0.9611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.43618 0.06491 6.720 6.95e-11 ***
## studytime -0.10350 0.02720 -3.805 0.000166 ***
## absences 0.01669 0.00419 3.984 8.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1936026)
##
## Null deviance: 77.700 on 369 degrees of freedom
## Residual deviance: 71.052 on 367 degrees of freedom
## AIC: 447.48
##
## Number of Fisher Scoring iterations: 2
#creating new table with probabilities and predictions of model
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc <- mutate(a, probability = predict(mm, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
#true values of high alcohol use
actual <- summary(alc$high_use)
#predicted values of high alcohol use with studytime and absences
prediction <- summary(alc$prediction)
act_pred <- cbind(actual,prediction)
act_pred
## actual prediction
## Mode "logical" "logical"
## FALSE "259" "349"
## TRUE "111" "21"
#actual: there are 259 who have low alcohol consumption, as 111 have high
#model: there are 349 who have low alcohol consumption, as 21 have high
#plotting the actual alcohol consumption rates and probability rates:
plot(alc$alc_use)
plot(alc$probability) #plots seems to go down to page
#it seem the model underestimates the high alcohol consumption
#2x2 crosstabular of actual values and predictions
table(alc$high_use,alc$prediction)
##
## FALSE TRUE
## FALSE 252 7
## TRUE 97 14
#cross-tabulation shows quite many TRUE-FALSE in prediction
#not very good fitting to the model
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = mm, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1] #model gives wrong predictions: 0.2864865 = ~29%
## [1] 0.2864865
#guessing predictions with studytime=4, absences=0:
predict(mm, type = "response",
newdata = data.frame(studytime = 4, absences = 0))
## 1
## 0.02218847
#model prediction:0.02218847 = 22%
#when studytime=4, absences=0; alcohol consumption values 1.0, 2.0, 1.0 and 1.5 (scale:1-5)
(((1+2+1+1.5)/4)/5) #average and normalised value
## [1] 0.275
#real: 0.275 = 28%
#model quite right with these settings (real:FALSE, prediction:FALSE)
#guessing predictions with studytime=1, absences=27:
predict(mm, type = "response",
newdata = data.frame(studytime = 1, absences = 27))
## 1
## 0.7833464
#model prediction:0.7833464 = 78%
#when studytime=1, absences=27; alcohol consumption value 2.5 (scale:1-5)
(2.5/5) #average and normalised value
## [1] 0.5
#real: 0.5 = 50%
#model okay-ish with these settings (real:TRUE, prediction:TRUE)
#the model works to some extend okay, but there is a lot of error
#model would need more predictors to make it better
##############################################################################
#10-fold cross-validation
cv10 <- cv.glm(data = alc, cost = loss_func, glmfit = mm, K = 10)
# average number of wrong predictions in the cross validation
cv10$delta[1] #model gives wrong predictions: 0.2837838 = ~28%
## [1] 0.2864865
#Model does not get better with 10-fold cross-validation
#This set seems to have higher prediction error than in the Exercise3 (26%)
#With differ/more predictors such model could be perhaps found
##############################################################################
#making a new logistic regression model with lots of predictors:
m2=glm(high_use ~ age+traveltime+studytime+famrel+freetime+goout+health+failures+absences
, data=a, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ age + traveltime + studytime + famrel +
## freetime + goout + health + failures + absences, family = "binomial",
## data = a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7921 -0.7468 -0.4853 0.6852 2.4889
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.85188 2.06322 -1.867 0.06191 .
## age 0.06714 0.11775 0.570 0.56855
## traveltime 0.37155 0.18580 2.000 0.04553 *
## studytime -0.46960 0.17578 -2.671 0.00755 **
## famrel -0.42247 0.14607 -2.892 0.00383 **
## freetime 0.18999 0.14557 1.305 0.19184
## goout 0.70070 0.12886 5.438 5.4e-08 ***
## health 0.16616 0.09720 1.709 0.08736 .
## failures 0.22758 0.23451 0.970 0.33181
## absences 0.06848 0.02202 3.110 0.00187 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 358.70 on 360 degrees of freedom
## AIC: 378.7
##
## Number of Fisher Scoring iterations: 4
#counting average number of wrong predictions
cv2 <- cv.glm(data = a, cost = loss_func, glmfit = m2, K = nrow(a))
cv2$delta[1] #0.2459459 average number of wrong predictions in the cross validation
## [1] 0.2459459
#adding predictions and probability to the table:
alc2 <- mutate(a, probability = predict(m2, type = "response"))
alc2 <- mutate(alc2, prediction = probability > 0.5)
#seeing actual values and predicted ones in one table
prediction2 <- summary(alc2$prediction)
act_pred2 <- cbind(actual,prediction2)
act_pred2
## actual prediction2
## Mode "logical" "logical"
## FALSE "259" "294"
## TRUE "111" "76"
#model2 (m2) is getting closer to the real data than model mm
#2x2 crosstabular of actual values and predictions:
table(alc2$high_use,alc2$prediction)
##
## FALSE TRUE
## FALSE 232 27
## TRUE 62 49
#There are more FALSE-TRUE, but also way more TRUE-FALSE
#Overall the model seems to be better
#adjusting the model according to significant predictors:
m3=glm(high_use ~ traveltime+studytime+famrel+goout+health+absences
, data=a, family = "binomial")
summary(m3)
##
## Call:
## glm(formula = high_use ~ traveltime + studytime + famrel + goout +
## health + absences, family = "binomial", data = a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8162 -0.7462 -0.4806 0.7134 2.5082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.35767 0.84403 -2.793 0.00522 **
## traveltime 0.38673 0.18207 2.124 0.03366 *
## studytime -0.52063 0.17144 -3.037 0.00239 **
## famrel -0.40296 0.14244 -2.829 0.00467 **
## goout 0.77461 0.12334 6.280 3.38e-10 ***
## health 0.17958 0.09532 1.884 0.05957 .
## absences 0.06861 0.02191 3.132 0.00174 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 362.02 on 363 degrees of freedom
## AIC: 376.02
##
## Number of Fisher Scoring iterations: 4
#counting average number of wrong predictions:
cv3 <- cv.glm(data = a, cost = loss_func, glmfit = m3, K = nrow(a))
cv33 <- cv3$delta[1] #0.2378378 average number of wrong predictions in the cross validation
#adding predictions and probability to the table:
alc3 <- mutate(a, probability = predict(m3, type = "response"))
alc3 <- mutate(alc3, prediction = probability > 0.5)
#seeing actual values and predicted ones in one table:
prediction3 <- summary(alc3$prediction)
act_pred3 <- cbind(actual,prediction3)
act_pred3
## actual prediction3
## Mode "logical" "logical"
## FALSE "259" "293"
## TRUE "111" "77"
#model3 (m3) is getting even closer to the real data (predictability better)
#2x2 crosstabular of actual values and predictions:
table(alc3$high_use,alc3$prediction)
##
## FALSE TRUE
## FALSE 232 27
## TRUE 61 50
#There are more FALSE-TRUE, but also way more TRUE-FALSE
#Overall the model seems to be better
#In both models where there are more predictors/more significant predictors , the prediction model gets better (error smaller and model fit better with cross tabulars)
#I did not know how to make a graph to show both training and testing errors by the number of predictors in the model
date()
## [1] "Mon Nov 27 16:00:03 2023"
R-Script: https://github.com/venkatharina/IODS-project/blob/Data/'create_alc.R
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
#loading the data
data("Boston")
#exploring the dataset
dim(Boston) #504 rows and 14 variables
## [1] 506 14
str(Boston) #numeric and integrin vectors
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#Data describes methodological problems associated with the use of housing market data to measure the willingness to pay for clean air
#With the use of a hedonic housing price model and data for the Boston metropolitan area, quantitative estimates of the willingness to pay for air quality improvements are generated
#Marginal air pollution damages (as revealed in the housing market) are found to increase with the level of air pollution and with household income
#graphical overview and summary of the data:
plot(Boston)
pairs(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle")
#lots of variables, some seem to not correlate, some seem to correlate negative/positive to each others
#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#all the data was scaled to zero (mean to zero)
#`crim` (per capita crime rate by town)
#cut the variable by to get the high, low and middle rates of crime into their own categories
boston_scaled$crim <- as.numeric(boston_scaled$crim)
#using the quantiles as the break points (bins)
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#creating categorical variable
crime <- cut(boston_scaled$crim, breaks =bins, include.lowest = TRUE)
#removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
#adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
#creating testing and training data:
n <- nrow(Boston)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
#doing linear discriminant analysis
lda.fit = lda(crime ~ ., data=train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 0.2326733 0.2549505 0.2500000 0.2623762
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.411] 0.9580386 -0.9466004 -0.188560978 -0.8677103 0.4469818
## (-0.411,-0.39] -0.0613828 -0.2956176 -0.004759149 -0.5704619 -0.1290783
## (-0.39,0.00739] -0.3751653 0.2179049 0.156464027 0.3576492 0.1005847
## (0.00739,9.92] -0.4872402 1.0149946 -0.049474337 1.0336472 -0.4176064
## age dis rad tax ptratio
## [-0.419,-0.411] -0.9201614 0.9093828 -0.7069722 -0.7212435 -0.3961584
## (-0.411,-0.39] -0.3268389 0.3959239 -0.5403247 -0.4558603 -0.1117540
## (-0.39,0.00739] 0.3839330 -0.3798954 -0.3905813 -0.2826861 -0.2483720
## (0.00739,9.92] 0.8078213 -0.8507681 1.6596029 1.5294129 0.8057784
## black lstat medv
## [-0.419,-0.411] 0.38073394 -0.77075666 0.522037827
## (-0.411,-0.39] 0.31943169 -0.14671816 -0.001666894
## (-0.39,0.00739] 0.09724369 0.05042739 0.178690084
## (0.00739,9.92] -0.74216675 0.82342720 -0.635123495
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.14234923 0.77355752 -1.00115887
## indus 0.05873694 -0.42487823 0.19906310
## chas -0.07745778 -0.03174009 0.21499570
## nox 0.28236809 -0.65787375 -1.56879344
## rm -0.08522737 -0.10099287 -0.20396763
## age 0.30411184 -0.25861332 0.09979823
## dis -0.13530708 -0.30376430 0.27204418
## rad 3.23813607 0.80968693 0.05291586
## tax -0.05735095 0.11169132 0.68322897
## ptratio 0.12113357 0.14802566 -0.45004883
## black -0.15959677 0.01577097 0.14805267
## lstat 0.16404407 -0.31744726 0.21562969
## medv 0.17056246 -0.41179431 -0.31072628
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9498 0.0366 0.0136
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
#plotting lda results biplot)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
#saving the crime categories from the test set
correct_classes <- test$crime
#then removing the categorical crime variable from the test dataset
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 18 13 2 0
## (-0.411,-0.39] 2 19 2 0
## (-0.39,0.00739] 0 11 14 0
## (0.00739,9.92] 0 0 1 20
#predicted values seem correct only in class 4 (0.00739,9.92)
#mostly predictions seem okay-ish (majority in right class), but it could be better
#reloading the data
data("Boston")
#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled$dis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2658 -0.8049 -0.2790 0.0000 0.6617 3.9566
#calculating distances between the observations
distance <- dist(boston_scaled)
summary(distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#trying to identify right cluster number
set.seed(140)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#right cluster number seems to be 6 for k-means clustering
km <- kmeans(boston_scaled, centers = 6)
# plot the dataset with clusters
pairs(boston_scaled, col = km$cluster)
############ BONUS ############
#reloading data:
data("Boston")
#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
#k-clustering
km3 <- kmeans(boston_scaled, centers = 8)
boston_scaled$kmcluster <- as.numeric(km3$cluster)
#making lda modeling
lda.fit2 = lda(kmcluster~., data=boston_scaled)
lda.fit2
## Call:
## lda(kmcluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6 7
## 0.21343874 0.04150198 0.21541502 0.06521739 0.10869565 0.15415020 0.10276680
## 8
## 0.09881423
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3768106 -0.425505051 -0.3701112 0.09221725 -0.2286148 -0.35000772
## 2 3.2082952 -0.487240187 1.0149946 -0.27232907 1.0998477 -1.45566793
## 3 -0.4060502 -0.005364112 -0.6583486 -0.27232907 -0.8405622 -0.04480167
## 4 1.1193656 -0.487240187 1.0149946 -0.27232907 0.9950574 -0.19450805
## 5 -0.4146327 2.524683754 -1.2040537 -0.20074543 -1.2034679 0.73835915
## 6 0.4620310 -0.487240187 1.0149946 0.13147608 1.0021383 -0.15800782
## 7 -0.2727474 -0.487240187 1.5613334 0.10623826 1.1631724 -0.63230595
## 8 -0.3681800 -0.053323566 -0.7442735 0.59383298 -0.2416605 1.68533550
## age dis rad tax ptratio black
## 1 0.3376923 -0.1286995 -0.5799077 -0.53244743 0.22283672 0.2729966
## 2 1.0064301 -1.0710604 1.6596029 1.52941294 0.80577843 -0.1691195
## 3 -1.0244892 0.8700429 -0.5720054 -0.70802741 -0.07989337 0.3722489
## 4 0.7406815 -0.8461740 1.6596029 1.52941294 0.80577843 -3.2850509
## 5 -1.4205771 1.6272606 -0.6832698 -0.53843478 -0.89403340 0.3540831
## 6 0.6920432 -0.7470446 1.6596029 1.52941294 0.80577843 0.1640227
## 7 0.9324774 -0.9083143 -0.6130366 0.03111252 -0.35520300 -0.1409865
## 8 0.1056916 -0.2903328 -0.4926242 -0.78414273 -1.08156698 0.3392477
## lstat medv
## 1 0.1641254 -0.25817616
## 2 2.0102765 -1.39479170
## 3 -0.5625097 0.09778119
## 4 1.1195132 -1.03254705
## 5 -0.9678176 0.83523455
## 6 0.3945960 -0.31539874
## 7 0.6799267 -0.51668841
## 8 -0.9695286 1.72241105
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## crim 0.19364384 0.126100236 -0.20191652 -0.376580103 -0.838531764
## zn -0.05343615 1.094813039 1.50799254 -1.133131460 0.197678623
## indus 0.55399704 -1.280702654 1.13241037 -1.197258173 0.124603383
## chas -0.03302284 -0.024772810 -0.14210875 0.080434989 0.122406528
## nox 0.05435103 -0.424948658 0.26397352 -0.771840326 0.285880375
## rm -0.04813200 0.181941956 0.09971698 0.277384152 0.689446855
## age 0.14103637 -0.651476207 -0.09019746 0.008315478 0.502754473
## dis -0.33219936 0.286780273 0.03476554 -0.281273725 -0.252480041
## rad 5.93891512 1.865585385 -1.41594980 0.744695993 0.427251550
## tax -0.05791010 0.519204288 0.49438996 -0.580198607 0.087932891
## ptratio 0.21221433 -0.011384841 0.01954111 0.048612579 -0.116143664
## black -0.19991944 0.269843607 -1.52326403 -1.543950260 -0.002263929
## lstat 0.07512642 -0.006494829 0.08239517 0.043407190 -0.209298121
## medv -0.08095390 0.223086304 -0.03618324 -0.137450228 0.487212102
## LD6 LD7
## crim -1.055793974 0.76410951
## zn -0.472879553 -0.75684606
## indus 0.908040391 1.05671771
## chas -0.139637868 -0.18984158
## nox 0.186010320 0.32881295
## rm -0.077931268 0.31058189
## age -0.573790467 -0.91403286
## dis 0.721069945 0.66003474
## rad 0.851480550 0.28819620
## tax -0.245047274 -0.77133610
## ptratio -0.008931086 -0.50612180
## black 0.001693988 -0.02201946
## lstat -0.741261471 0.23456713
## medv -0.594939439 0.54523880
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5 LD6 LD7
## 0.7632 0.1051 0.0506 0.0403 0.0205 0.0141 0.0062
#plotting
plot(lda.fit2, dimen = 2)
lda.arrows(lda.fit2, myscale = 1)
############ SUPERBONUS ############
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
#install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= 'train$crime', colors=c('red','green','blue','violet'))
(more chapters to be added similarly as we proceed with the course!)